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ABSTRACT 


Digital encoding of speech to allow more efficient 
transmission at low data rates involves the decomposition 
of the speech waveform into various parameters which are 
related to the physical structure of the speech production 
Process. In this thesis, linear predictive coding is used 
Bombprocwce 4 set of coefficients for the characteristic 
polynomial of sucessive 25 msec. segments of the voice 
track, in the z-domain. The location of the poles in the 
z-plane and the excitation pitch period are then shifted 
and the signal reformulated to cause changes of the overall 
frequency characteristics of the speech waveform, while 
maintaining the perceived sounds and information content. 
The resulting audio tapes confirm the theory and 


conjectures of the thesis. 
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Digital processing of speech signals has become 
important and necessary with the introduction of high-speed 
digital devices into every phase of communication: place to 
place; man to machine; and machine to man. 

Digital signals have a number of inherent advantages 
over analog signals. Digital signals may be coded for 
Secumity or Tor nolse immunity. A digital voice signal may 
be transmitted by the same equipment used for data and it 
may be multiplexed with that data. One of the primary 
disadvantages of the digital transmission of voice is the 
large bandwidth required with some digital techniques. 

When analog techniques, such as single side-band amplitude 
mecilation, Produce bandwidths of 5KHz and the best digital 
system bandwidth was 64khz, there was a very strong 
tendency te stay with the analog techniques. 

However, recent advances in digital signal processing 
have made the digital transmission of voice highly 
efficient. Until recently digital transmission of speech 
was possible only by sampling the voice waveform at a 
Siprlelentiy high rate .and then performing an 
analog-to-digital conversion of each sample. A sufficient 
number of bits were transmitted for each sample which was 
sent to reconstruct the waveform at the reciever. The 


voice waveform must be sampled at aproximately 8,000 








samples per second to avoid the loss of clarity. Each of 
the samples must then be converted to a 6-10 bit number for 
transmission. The overall data rate using these methods had 
paower limit im the neighborhood of 48,000 bits per 
second. 

Recent developments have allowed the voice pattern to 
be broken down into more basic parameters which are closely 
Emseelated with tie physical production of speech. These 
parameters vary rather slowly and can be transmitted at a 
lower rate. Data rates as low as 1200 bits per second have 
been achieved through the use of these techniques. 

These methods are numerical representations of the 
physical production of speech, and therefore it is easier 
to alter the characteristics of speech by altering the 
associated parameters then by trying to alter the waveform 
Guepece |] y. 

This thesis reviews various digital speech processing 
techniques for use in a speech modification system. Linear 
oredictive coding (LPC) was chosen for implementation and 
therefore the theory and practice of this technique are 
explained in detail. The desired modification of the 
speech waveform by shifting the poles of its characteristic 
polynomial, and the regeneration of the altered waveform 
are discussed and the implementation techniques explained. 
The IBM 360 computer was used for simulating the techniaues 
developed. This simulation is covered in detail and the 


computer programs, with results, are provided. 





It 2 SPEECH PRODUCTION AND CHARACTEPISTIC 


Any digital system for altering speech characteristics 
must be based on knowladge of those characteristics and the 


physical structure which determines them. 


eee e ee CHARACTERISTICS 

All speech can be brecken down into a set of distinctive 
sounds called phonemes. In the case of American Englisna, 
there are pene rainy considered to be 42 distinct phonemes 
which are classified into vowels, diphthongs, semivcwels 
and consonants. Spoken communication is accomplished 
through various combinations of these sounds and the 
accurate reproduction of each is a major criteria in 
judging voice processing systems. Phonemes are generated 
at a rate of about ten per second. Each phoneme is 
classified as voiced if vocal cord vibration its the source 
Pee mscunad Of UnvVolced if the sound is produced by other 
means. lf the characteristics of a phoneme change from the 
Geareeto fimisn, the cpheneme is called noncontinuant. These 
phonemes which aire stationary are called continuant. 

The lowest frequency present in a given veiced sound is 
called the pitch frequency. There are peaks in the spectral 
representation of a speech sound that are above the fitch 
frequency which are called formants and are numbered 


consecutively with increasing frequency. Although two 


Ie 








speakers may produce the same phoneme, the pitch and 
formant frequencies may be different. However, general 
relationships may be established between pitch and formant 
frequencies which are relatively constant from speaker to 
speaker, producing the same phoneme. Lint Obmat ion isaaike 
be retained by a speech processing system, it must be able 
bomreprodice at output, the pitch and formant frequency 


relationship which was present at the input. 


Emr o CAL SPEECH PRODUCTION STRUCTURE 

The vocai tract is a resonant tube with the vocal cords 
at one end and the lips at the other. The vocal tract acts 
as a frequency selective filter which has a transfer 
numeeron that depends on how it is shaped at any given 


time. 





Gey) VO) Ori (B) UNVOICED 
mC wm. |. SOUND PRODUCTION 


The input to the vocal tract is caused by either the 
vibration of the vocal cords at the lower end (figure l.a) 


or by the turbulence of air being forced through a 


Ly 








constriction at any of a number of locations along the 
Wiecmeecract (tieure lib). The vocal tract acts as a filter 
with a pulsed input from the vocal cords when producing 
voiced sounds such as 'a' or 'o'. During sounds caused by 
Ee TOrCing of air through a constriction, fricative sounds 
like ‘s' or 'f', the vocal tract acts as a resonant cavity 
which will have certain characteristic response 
frequencies. Typical waveforms for voiced and unvoiced 


sounds are shown in figure 2. 


v8) 1G, 28, 


UNVOICED 
FIGURE 2. TYPICAL WAVEFORMS 


waeineenatacter|istics of the vocal tract are changed 
several times per second to produce different sounds while 
others such as overall] length and the diameter range limits 
are fixed for a given speaker. A detailed look at each of 
the types of sounds will insure that the digital processor 
used has the same flexibility as the actual speaker. 
Vowels, voiced continuant sounds, are produced when 


tiewvocel cords Vibrate causing pulses of air at the bottom 


Me 








G@umenecmvOcal tract. The shape of the vocal tract remains 
meccdraurime Vowel production, acting as a stationary 
Meee tO Fespond to the forcing funetion. 

iMicma OCG lhOnmoOhmed lpn eEnCnes andesemi vowels 1S Similar 
to that of vowels except that the shape of the vocal tract 
is smoothly changed during voicing. Diphthongs and 
semivowels are noncontinuant, voiced sounds. 

The phonemes classified as consonants may actually be 
further divided into subcatagories of voiced fricatives, 
unvoiced fricatives, stops and nasals. Fricatives are 
ealisecdmpy the steady flow of air through a constriction in 
the vocal tract which causes turbulant air motion and a 
seemingly random air pressure pattern. Fricatives are 
voiced or unvoiced depending on whether the vocal cords are 
producing pressure pulses at the same time. Stops or 
plosives are caused by completely closing the vocal tract 
and then suddenly opening it to quickly start sound 
production. A stop is classified as voiced or unvoiced 
depending on the nature of the sound that follows the 
opening of the vocal tract. Nasals are voiced sounds which 
are formed when the vocal tract is closed and air is 
allowed to pass through the nasal cavity. This acts as a 
feed forward path for the sound and a corresponding change 


LoMGeUuseaminetie total vocal tract response. 


eee Neon tAyl ON, CONTENT 


One of the primary goals of speech processing is the 


1) 











development of efficient codes for transmitting or storing 
seeecn amd still allowing it to be reconstructed without 
excessive loss of information. The source coding theorem 
States that through the proper choice of coding we can code 
a source into a bit sequence arbitrarily close in length to 
the entropy of that source. However, efficient codes are 
difficult to find for even simple binary sources, let alone 
a continuous speech source. An estimation of the entropy of 
a typical speech source provides a useful guage for 
measuring the data rate performance of any system. 
if ‘excessive loss of information! occurs only when we 
don't receive the correct one of the 4&2 phonemes, the 
information content of one second of speech is 
approximately (assuming 10 phonemes are produced per 
Second ) ; 
42 
H = 10 y Pip, ) Caer (Pe)? 
| i=1 
where aera.) is the probability of the ith phoneme. Assuming 


further that each phoneme is equally likely, 
H = 10 x 4&2 x 1/82 x log 42 = 54 bits per second 


bemuhewacttal probability of each plhmoneme was used, |.e. 
they are not equally likely, the value of entropy would be 
Significantly lower. 


'f texcessive loss of information! also includes 








failure to identify the speaker and failure to indicate the 
speaker's emotional state the information content is 
higher. However if we assume that identification of the 
speaker (one of about two billion) is only reauired once 
per minute and that the speaker's emotional state (say one 
of ten) can only change once per second the entropy is 


sell! only 58 bits. 
Biespeaker) = 1/60 x 10 °x 1/10 « (=log(1/10 )) = 0.5 
RicemoOrlon)) = Lomxe1/10 x (-los (17120)) = 3.3 
H( phoneme) = 54 bits per second 
H( total) = 58 bits per second 


Clearly the theoretical limit is not being pushed by the 


Current state of the art in speech coding. 
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l!1. DIGITAL SPEECH PPOCESSING TECHNIQUES 


Digital speech processing techniques may be placed into 
three general categories based on the assumptions used in 
their development. The first category is that of waveform 
techniques where the only primary assumption is that the 
signal which is being processed is frequency limited to no 
more than half of the sampling freauency. The second 
category of spectral methods adds the assumption that the 
Mmeecquenmey domain characteristics of tke speech waveform 
vary siowly. Finally, the voice tract parameter techni aues 
assume that the physical voice production system can be 


modeled digitally. 


A. WAVEFORM METHODS 

Waveform techriques have the characteristic of 
operating equally well on any low-pass filtered waveform 
and all are generally based on the familiar pulse code 
modulation. The basic requirements of a waveform 
quantization metnod is trat the waveform be sampled at 
greater than twice the highest frequency present and that 
the samples be quantized into a digital code for 
Eeomcitiissionw ©f2|theuch this techniague is very straight 
forward, it also requires a high data rate. A waveform 
sampled 9600 times per second with each sample quantized to 


256 levels would require 76,800 bits per second for 
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transmission. A number of variations (differentia! 
modulation and adaptive differential modulation) have been 
used to reduce the required data rate but have failed to 


cut the required data rate by more than about half. 


Eye SoeEECTRAL TECHNIQUES 
Ll. Short Term Frequency Analysis 

These methocs deal with the short-term freauency 
properties of the speech signal. An early spectral method 
was the channel vocoder. The transmitting processor of the 
channel vocoder consists of a bank of narrow-band analcg 
filters. The energy passed by each filter is measured and 
transmitted to the receiver site. It is also determined 
whether the input speech was votced or unvoiced and that 
determination is transmitted. In the receiver an 
excitation signal, determined by the vcicing decision, was 
fed into a bank of narrow-band filters, each of which hed 
an adjustable gain determined by the received energy 
measurements. 

The same technique can be implemented in an al! 
digital method by replacing the bank of analog filters with 
fesiealie milters or by performing a discrete Fourier 
transformation (DFT) on a frame of input samples. The use 
of the DFT is usually preferred because of ccmputational 
efficiency and the availability of high-speed DFT array 
processors. Normally each input frame is windowed to 


reduce the noise which can be caused by a sharp cut of* at 


i 








the end of a frame. When this method is used to reduce the 
Jemederate required for digital transmission, the total DFT 
of each frame is not transmitted because the total DFT 
would require the same number of bits as the frame of 
samples (assuming both are quantized to the same number of 
levels). Reduction in the data rate can be accomplished by 
skippirg frames and assuming trey are duplicates of the 
preceeding framemauinias reconstruction. The number or 
samples in tne frame is also half the number of frequencies 
resolved by the DFT, therefore the frame length for 
analysis is choosen as a compromize between accuracy of 
volce reproduction and the desire for a low data rate. 
This method of speech processing would lend itself 

PelleeG altering the frequency characteristics of voice 
signals but it requires a relatively high data transmission 
rate and therefore was not desirable for speech Drocessing 
feeconyenction with place to place communications or with 
digitally stored speech. 

2. Homomorphic Processing 

Another method which involves freauency demain 

processing is homomorphic processing. It is based on che 
following three principles: 

Giemiceeecn is the Convolution of am excitation 

minaehon and tne transfer function of the wocal 

Erect. 


(2) Convolution in the time domain is equivalent to 
multiplication in the frequency domain. 


(3) The Fourier transform is a linear 
EeaisrormatioOn, 1.¢. 
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prey (tee (x(t) ) + BGy(t)) = X(w) + ¥Cw) 
A method of separating a speech waveform back into these 
components would help us analyze the speech. Homomorphic 
processing centers around the efficient deconvolution of 
these signals. 

First the input signal is windowed and transformed 
via the DFT, to produce the frequency comain representation 
Of the input speech. The time convoluticn of two signals is 
equivalent to multiplication in the frequency domain. 
However knowing the product of two waveforms does little 
Bovanaecalning knewledze of the multiplicands unless 
further information is given. The multiplication of the 
two values at a given frequency is equivalent to adding the 
logarithms of each. The log is taken of each of the values 
in the frequency domain representation of the signal which 
is then equal to the sum of the the log of the frequency 
domain representation of the excitation function plus the 
the log of the frequency domain representation of the vocal 
tract function. However, it is easier to tell the 
difference between the vocal tract excitation functions in 
Preeweimeradomaim, so the inverse DFT is taken of the log of 
Eme frequency domain function. The function produced is 
called the cepstrum of the signal. Because taking the 
Mivecse OF) Is) amiinear fUNcEIOn, and the frequency domain 
function was the sum of two component functions, the time 


domain cepstrum must also be the sum of the cepstrum of the 


Is 
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Svoltamvon flnctionm and the cepstrum of the vocal tract 
Fumceionm. Figure 3 illustrates the relationship between 
the steps of homomorphic deconvolution of signals. 

Examination of the cepstrum between 2.5 and 20 
msec. may reveal a peak that is considerably above the 
background noise level. [If a peak is there, the segment is 
determined to be voiced with the peak occuring at the pitch 
period. The vocal tract is not long enough to sustain any 
vibrations for more than 20 msec. after a pulsed input. 
lf there is no peak the segment is considered unvoiced. 

The cepstrum of the excitation function may be subtracted 
from the total cepstrum and the remainder considered an 
estimate of the cepstrum of the vocal tract transfer 
function. After working backwards to magnitude (vs. log of 
maanhenae) im the frequency domain, the filter coefficients 
may be determined. 

[t would be relatively straight forward tc alter 
Pemmeeme excitation function and the vocal tract transfer 
mimceronm atteresthe tctal cepstrum is broken into its 
additive components. However, homomornhic processing was 
feoLeperns widely used for voice communication and this 
technique was dropped in favor of a more widely used 
system. As array fast Fourier transform processors become 
faster and less expensive, homomorphic speech processing 


may become the dominant speech communication technique. 
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C. VOICE TRACT PARAMETER TECHNIQUES IN THE TIME DCMAIN 

The primary characteristic of this catagory is the 
close tie between the digital process and the physical 
structure being modeled. Although homomorphic processing 
uses the deconvolution of the vocal tract function and the 
excitation fUmetion as a primary tool, the homomorphic 
process does require transformations to and from the 
frequency domain and therefore is not included in this 
catagory. The primary member of this catagory is the linear 
tmecretion coding (LPC) process which has shown itself to 
Pemcmemzethe best and mest versitile of the various speecn 
processing techniques. 

1. The Speech Mocel 

The speech model assumed and used for LPC is that 

Qieceeime-VvVarving digital filter which is excited by a 
wide-band functicn, either a pulsed input or random noise. 
milcmismt! | )uSstrated in figure &. The recursive filter used 
to model the vocal tract is all-pole and has slowly time 
varying (pseudo-stationary) coefficients. The filter's 


mea@onmeaine cEransrer function 1s 
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Dp 
Vz) = Utz) a 
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Om in the discrete time-domain 


ie Sgn em Ct) Hy ayn 
i=l] 
From the time domain equation it is clear that the current 
Sueewery(ni) is Uniquely specified in terms of the current 


input and the past p output values. 
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The vocal tract is not always best modeled by an all-pole 
Filter, and particularly nasal sounds would probably be 
best modeled by a filter which also included zeros. Hovwever 
there is considerable difficulty in rapidly estimating both 
20les and zeros of a transfer function when only a short 
Segment of the output is available for analysis. However, 
experience has shown that high quality voice production is 
possible by using an all-pole filter of adequate order. 
Miemordenmon etme filter required is closely related 

to the length of the vocal tract. To adequately represent 
tme lower frequency response of the vocal tract, the filter 
must include recursive delay equal to the delay encountered 
Byesoumad waves traveling from the vocal cords to the lips 
emee Feturning to the glottis. 

velocity of sound = 344 m/sec 

lenaceneor Yoeals tract == 91/ Gm 


Zoe Xe OP oa O98 6. nSee 
344 


At a sampling rate of 10kHz at least 10 past values would 
need to be included for an accurate model. 

ime Excitation functiomefor veu ced sounds in 
modeled by a train of pulses at the glottis. Clearly these 
pulses can not be a perfect set of impulses, but rather 
must have a finite width and are likely to have a definite 
Shape. Rather than construct a separate filter to change 
the impulses into the correct shape, additional poles are 


Baaecdeto the model so that the combimed transfer function 
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may be calculated at once. Normally two additions poles are 
adequate for the pulse shape model. 
2. Linear Predictive Techni ques 

PitcamepinecdiGhivel analyoisieebased om the division 
of speech modeling into modeling of the excitation function 
and modeling of the vocal tract transfer function. The 
vocal tract is modeled by computing each Sample as a 
weighted linear combination of previous samples. Linear 
predictive coding of speech is accomplished by rilteniggs a 
sampled speech waveform through a filter which is the 
inverse of the filter which models the vocal tract. If the 


= 
_ 


Filter used is the inverse of a good mocel ot the vocal 
tract, the output will be a good approximation of the 
excitation function. The various preperties of the 
excitation function, along with the coefficients used jn 


the vecal tract filter are neasured and transmitted as 


Bevin im figtire 5. 
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The received measurements are used in tha decoding 
P@eesscon, £O 7Teconstruct the excitation function and the 
Filter. The process of reconstructing the speech waveform 


- 
—- 


icestOvwn im «izure 6. 
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Hiemorimnlary advantage in themtse of emi near 
predictive coding of speech is the reduction in the data 
rate required for transmission or storage. LPC systems have 
ee developed which require data rates from 3000 to 4800 
bits per second for high quality voice communication and 
rates as low as 1200 bits per second have been reported for 
lower quality but understandable speech production. Highly 
efficient algorithms have been developed for the encoding 
and decoding of speech using the LPC technique. When 
hardware implemented with special purpose, short word 
length microprocessors, the computations reauired for 
two-way communication have been done in 65% of real time. 

LPC was chosen as the method to be used for 
accomplishing the desired voice characteristic 
modifications. A detailed description of the theory and 


modeling assumptions follows. 
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IV. LINEAR PREDICTION THEORY 


Linear prediction is an extension of least squares 
estimation. In the case of one-dimensional linear 
prediction, it is more common] y labeled as time series 
analysis when used by statisticians for analysis of 


everything from population to the stock market. 


Pee THEORY 
[lt is assumed that each sample of the discrete time 
series, s(kT), as shown in figure 7 may be approximated by 


a linear combination of past samples of the time Series. 


m 
s(kT) -) a SCG) 
i =1 
where s(kT) is the estimated sample value, a. is the 
coefficient of the sample i steps past and ™m is the order 


of the approximation (and as we will see later the order of 


the z-domain filter of the model). 


S(T) 


ae 
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For a portion of the discrete time series (N Samples where 
Nom), a least squares approximation of the weighting 
eoerficients, a., may Se caletilatec=e ine estimate at each 


Bol mt 


mM 
S(kT) = » are ocala 
[=1 


6 os am Se 
Paes tbttracted from the actual sample value and the error 


for each estimate, e(kT) is given. 


eCkiyes sCkije-- st(kT) 
i hee 
m 
e(kT) = s(kT) - ya s((k-i)T) 
j =1 
en ai 


To minimize the error (in a least squares sense) the error 
is squared and summed over all points in the region of 
Interest to obtain an overall error, E. 
N | N m ; 
E “ye (kD) “) sx-) fy sce 197) | 
k=1 k=1 j =1 | 

The derivative of E with respect to eacn of the 
coefficients, ae, is taken and set equal EO zero: lin. Order 
meIGecateerne minimum of E. This yields the following m 


equations. 
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N m m 
OE = =) f 2 [scKD-) ajsccet] bean) a, sce) 
Oa, Oa. ee 
i =1 , i=1 
i << 
however 
Oo [s(kT)| = 0 
Oa. 
J 
and 
O ja.s((k-1)T)| = 0 i Aj 
$a (38 | , 1 FY 
- SCT = j 
therefore 


a.s((k-i)T) eyes Ck, iD 


S 
Pe 
r 
(=>) 
rT 
el = 
fO 
f 2) ) 
a 
— 
al =! 


Gera 


removing the constant multiplier 


N N m 
= ) sckTs(k-j)7 = > Ya, 8k) TS CCkojIT 


k=1 k=l il 
L<jsim 
changing the order of summation 
N 
\ SUIS SIG hea oy" y SMa ls Gia i) 
k=1 i =1 =. 
SS ip Sal 


Given all of the samples within the summations over N, 
the above set of m equations in the m unknowns, a., can be 


solved. If only the samples 
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s(kT) l¢ckeEn 

are given, the set of equations above can not be solved 
because of the requirement to know the samples 

SCC IS ye, ee em 
However by windowing the samples so that all samples 
outside the region of interest are zero 

s(kT) = 0 k < 0 and k D> N 

the summations over N in the set of equations above may be 
replaced by the autocorrelation of the windowed samples, 


eck). 


RCj) = S Cis Cee) 1) 


~ 
I 


0<j <a 
This assumption may be made because the number of samples, 
N, is normally much greater than the order, m, of the set 
of equations. Therefore relatively few samples are lost. 
The window function used will not significantly alter the 
samples in the center of the frame, and therefore the 
resulting coefficients will be a correct approximation for 
that segment. The set of linear equations may now be 


written 


These equations may now be solved for the linear predictive 
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coefficients, a., ee Wee oin. 

lf the system being studied is stationary or we are 
only considering a pseudo-stationary segment of the system 
Swepuc, and if tne order of the model is sufficiently close 
to the order of the real system, future values of the 
variable may be calculated recursively from previous 
values. In the following section we will see how this 


theory is applied to speech modeling and reconstruction. 


B. LINEAR PREDICTIVE CODING FOR VOICE ANALYSIS 
The digital model used for speech synthesis is shown In 
figure 8. The discrete time excitation function is e(€nT) 


aamumessymetehesized speech output is s(nt). 
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ine Vacoietract filter is assumed to be all-pole and 


therefore can be represented by the z-domain equation 


mM 
H(z) = S(z) = wh 


m 
Tei 2b ee 
i =1 





Multiplying out the denominator and dividing both numerator 


my 


and denominator by Z yields. 


o 





HheZee= oz). = 1 


m 
E(z) ya ze 


This z-domain equation is converted to a discrete time 


domain equation as follows 


mM 
es ne -) af ) = E(z) 


j=l 


m 
cl 
S(zZ) = ECz) Pore 


j =1 


e(nT) “) ete ea) ale 


| =1 


S$ (nT) 


[If the excitation function e(nT) equals zero for a given 
Sample, then this equation is similar to the first equation 
in the previous section on the theory of linear prediction. 
The coefficients of the z-domain filter transfer function 
Pecmeauivalent to the linear prediction wieghting 
coefficients. 

Analysis of the sampled speech waveform is used to 
calculate the prediction coefficients which are then used 
in an inverse filter to determine the excitation function 
from the input speech. This Inverse filter may be 


represented as 


Oy 





Si as 


m 
EtnT) = SCA -) a. s((n-i)T) 


j=l 


and is construted as shown in figure 9. 
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The Input speech has been broken into vocal tract 
characteristics determined by the prediction coefficients 
and excitation signal characteristics which remain to be 
determined. During the encoding process the output of the 
inverse filter may also be considered an error signal 
because it is the difference between the actual speech 
sample and the predicted speech sample. 

Dimmer voiced Speech the vocal tract filter in figure 9 
acts as a model for the total transfer function which is 
due to the glottal pulse shape, the actual vocal tract 


shape and the output reflection at the lips. Idealy during 
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voiced speech all of these effects are removed by the 
inverse filter and the error function is a train of 
impulses at the pitch frequency. 

During unvoiced speech the physical excitation function 
Is a pseudo-random air pressure variation caused by 
turbulence at a constriction somewhere along the vocal 
tract. This wide-band source is filtered by the portion of 
the vocal tract between the constriction and the lips. This 
portion of the vocal tract will resonate at certian 
characteristic frequencies but normally the number of peaks 
in the frequency domain response will be fewer than for 
voiced sounds because of the shorter segment of the vocal 
tract in use. During encoding of unvoiced speech the output 
of the inverse filter is pseudo-random because the inverse 
Mmibecer Gan t predict the output due to the random input. 

The speech model is not complete with just the 
determination of the coefficients of the vocal tract 
filter. During speech reconstruction it Is necessary to 
know: 

(1) Which excitation signal, pulses or noise, to 

use. 

(ee wexcitation pulse period for voiced sounds. 

eemuneecain multiplication factor. 
Although these quantities are not necessarily determined 
using linear prediction theory, they are none the less 
required for a working speech encodIng/decodineg system. 


During encoding, the marked difference in the error 
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signal for voiced and unvoiced speech can be used as the 
basis for the voiced/unvoiced decision. The energy of the 
error signal for voiced speech should be rather small in 
comparison to the energy of the input samples. On the other 
hand, during unvoiced speech the prediction is poor and 
most of the energy remains after filtering. The ratio of 
the average energy or root-mean-square value of the speech 
samples to the similar quantity of the error signal can be 
used to make the voiced/unvoiced decission. This ratio is 
compared to an empirically determined threshold and the 
segment is considered voiced wnenever the ratio is greater 
moan the threshold. 

The gain used during reconstruction is the amplitude 
multiplier of the excitation signai at the input of the 
vocal tract filter. The gain used during unvoiced speech 
may be simply the root-mean-square of the error signal. 
iisecain coeffichent tS multiplied by the output of a 
random number generator which produces normally distributed 
numbers with a root-mean-square value of unity. 

The gain of voiced speech may also be determined from 
the root-mean-square value of the error signal. However 
during reconstruction of voiced speech the entire energy of 
the excitation signal is concentrated in a series of 
impulses which should have the same root-mean-square value. 
The root-mean-square value of a series of discrete-time 
impulses with amplitude, a, and a period, p, intervals is 


approximated by 
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N 
a Ze faz. 
rms = |N Ke 
i =] 
rms = |1 N Jina SF 
NP a 
N>>D 
=e 2 
rms = 4 p 


The output of a unit impulse generator should then be 
multiplied by 


eZ 
G= rms p 


to insure that the same energy is input to the vocal tract 
filter as was output by the filter during encoding. The 
above method for calculating the gain needed during 
reconstruction is based on the assumption that the 
prediction error for voiced speech is caused entirely by 
the physical excitation function of the speaker. However 
the prediction error may be increased because the vocal 
Soccer was) changime shape rapidly during the analysis frame 
or because of background noise at the microphone which 
would not be removed by the inverse filter. Either of these 
would cause an unwanted gain increase during 
reconstruction. A typical voiced speech waveform and the 


error signal generated from it are shown in figure 10. 
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(A) VOICED SPEECH 
WAVEFORM 


(B) ERROR SIGNAL 
WAVEFORM 


mc RE 10 . 


The reliable determination of the pitch period of 
voiced speech is a problem for which the ideai solution is 
still undetermined. The periodic increase in the amplitude 
of the error signal at the pitch period Is shown in figure 
10(¢b) and suggests the use of the error signal in pitch 
period determination. A number of algorithms exist for 
determination of the pitch period which generally involve 


moamlous combinations cf tne following processes. 


(1) Raising the error siznal to a given power. 
(2) Low-pass filtering of the error signal. 
(3) Windowing the error signal. 


GmeGalculating the autocorre]ation function of the 
filtered error signal. 


(5) Picking the peaks of the autocorrelaticn 
meme t | On. 


Experience has shown that pitch determination is 


computationally as difficult as the LPC parameter 


a7 
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determination and the literature on the subject illustrates 
the trade-off between hardware, software, computation time 


and reliability from method to method. 


C. LPC COMMUNICATION SYSTEMS 

A review of existing LPC communication hardware is 
useful because any method which alters formant and pitch 
characteristics of speech will be most successful if it is 
compatable with these systems. 

Currently off-the-shelf? microprocessors are not fast 
enough to handie the algorithms descri>ded in real-time. 
However special purpose units which are designed alorg 
computer lines, do meet the real-time criteria. On the 


' might not seam to fit these 


surface the word ‘computer 
special purpose machines, but a closer look will reveal 
that each has components which are the same as those of a 
computer: stored programming, memory, Input, cutput, an 
Melenmecic lorzic unit (ALU), an instruction set, and 
control components. Two processors which were developed at 
MIT's Lincoln Laboratory will be used to illustrate tne 
State Of the art in LPC voice terminals and certain 
similarities in their architecture will be evident. The 
first processor is the more flexible of the two and is 
designed to handle a wider varity of algorithms. The second 
vias developed about a year later and was designed 
Bpeeciaically for LPC algorithms with enly minor changes. 


The first processor to be covered is the Lincoln 
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Digital Voice Terminal (LEDVT) which was designed and 
constructed at the Lincoln Laboratory during the 1973-75 
time frame. This processor is capable of carryiag out 18 
million basic Instructions per second with a 16-bit by 
feat multiplication taking four times as long. The 
execution time for each instruction is 165 nsec. which 
seems to conflict with the instruction rate. This is 
resolved by the pipelining of the three portions of each 
basic instruction: fetch, decode, and execute. The 
processor has separate memories for data and the program. 
The data memory capacity [s 512 16-bit words and the 
program memory contains 1024 16-bit instructions. The 
pipeline instruction processing requires that the buses to 
and from the ALU be seperate and each is unidirectional. 
Figure ll shows the data paths of the LDVT (none of the 
control or timing lines are shown). There are four active 
registers: the P register which is the program counter with 
multiplexed inputs from the address portion of the 
instruction, the ALU, the sum of the X register and the 
address portion of the instruction, and itself incremented 
by one; the X register which is used for indexing memory 
addresses; the A register which is the accumulator; and the 
8 register which is actually a pair of registers used for 


input and output. 
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The ALU of the LDVT as shown separateiy in figure 12, 
has two sections: a standard programmable ALU which 
performs logical, addition anc compare Operations; and a 
16-bit by 16-bit multiplier array which provides a S2=D it 
result in just 4 cycles. Either of these may be used with 
any input, however due to their common input and output 
only one may be used at a time. 

ets significant to note some of the requirements 
brought on by the pipelining of the instrue=srons. 1Tne 
device does not have a main bus over which data flows in 
both directions. Generally all data flow is unidirectional 


and in the case of the ALU input buffer registers are 
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needed to hold the data for the instruction being executed 
while the next instruction may have already read a value 
inom memory and put this on the ALU Itnput lines tn addition 
to LPC algorithms at 2409, 3603 and 4300 bits per second, 
the LDVT has been programmed for adaptive predictive coding 
at 3000 bits per second and as a channel vocoder at 2400 


bits per second. 
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PpIGURE 12. LDVT ALU 
The second speech processor is the Linear Predictive 
Coding Microprocessor (LPCM) which ts disigned strictly as 
a low cost LPC terminal. The basic cycie time for this 
machina is 156 nsec. The data memory has 2K 16-bit words of 
which 1.5% is ROM and 0.5K is RAM. The program memory 


contains 1K of 48-bit words. The LPCM is almost free of 
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instruction decoding, with the only exception being the ALU 
@eeration. Figure 15 shows the instruction format and [In 
figure 14 it is evident that parts of the instruction 
register are being [nput as control functions. Figure 15 is 
a block diagram of the LPCM and shows the two buses and the 
large number of ragisters needed to control the data flow. 
While these machines have varying degrees of 
adaptability, it does not appear that either could handie 
the additional computations described in the following 
sections without major hardware modifications. However, a 
special purpose LPC code converter which could de used in 
conjunction with an existing terminal could probably be 
developed which would operate in real-time and not load the 


exlese lime processo-~. 
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V. ADJUSTMENT OF VOCAL TRACT PARAMETERS USING LPC 


One reference to voice characteristic modification was 
found by the author [Atal and Hauneur, 1971]. Although 
scaling of pitch, formant frequency and formant bandwidth 
was stated to have been accomplished, no description of the 
work was given, Other literature did provide useful 
information on formant freauencies and pitch periods which 
are typical for various speakers. It should be noted that 
there is a considerably larger variation, from speaker to 
speaker, in pitch period than in formant frequencies. As an 
example, two speakers, saying the same phoneme could easily 
have pitch periods that varied by’a factor of two, yet have 
only a 10-20 per cent variation in formant frequencies. 
Meanerent physical structure (vocal cords and the vocal 
tract) produce these speech characteristics (ditch period 
and formant frequencies, respectively) and therefore their 
variation from speaker to speaker is only partially 
@eorrelated. 

The coded information produced from input voice by the 
{PC processor is very closely related to the physical 
structure that is producing the sound. On output, speech is 
reconstructed from the gain, pitcn period and 
voice/unvoiced parameters as well as the vocal tract 
prediction coefficients. The gain and pitch period can be 


varied as they stand but the variation of the prediction 
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coefficients is somewhat more complicated. The goal of 
varying these coefficients before reconstruction is to have 
the output voice have different pitch period and formant 
frequencies while retaining a natural sound and retaining 
the same information, i.e. the same sequence of phonemes 
ana voice Inflection. 

Voice characteristics are associated with certain 
parameters of the LPC code. First, formant frequencies and 
bandwidths are associated with the LPC coefficients. The 
amplitude of the output voice is associated with both the 
gain coefficient and the formant bandwidths. The 
relationship between output amplitude and the formant 
bandwidth is due to the increased energy In the impulse 
response of a narrow bandwidth (high ©) transfer ftimeetl on. 
This is noted physically by the fact that speakers with 
highly resonant voices may speak louder for the same amount 
of energy expended. The pitch period is controled by the 
mimcen period coefficient only. Finally, the voice/unvoiced 
decission would normally not be changed. The exception 
would be if one was reconstructing whispered speech (the 


vocal cords are stationary) from normal speech. 


A. ADJUSTMENT OF FORMANT FREQUENCY AND BANDWIDTH 

The vocal tract model we are using has all real 
coefficients in the z-domain polynomial. Following directly 
from this is the fact that all poles must fall either on 


the real axis of the z-plane or in complex conjugate pairs. 
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Each of the complex conjugate pairs is associated with one 
formant (resonator) of the speech model. The vocal tract 
transfer function is the product of these resonator 
transfer functions which are each of the following form 
a a eee 
ao): = S71 EO =) SS aew) t= 2 
f 1-2e Coc.) Fate) 2 +e Z 
where F is the center frequency of the formant, f , and BW 
is the bandwidth of the formant. The pole locations 
associated with this transfer function are 
Zo OX NY 
This pair of poles must be moved in order to alter the 
frequency and bandwidth of this resonant section of the 
vocal tract model, but this must be done carefully so that 
the poles remain inside the z-plane unit circle. If the 
desired modification of the input speech is to reduce the 
bandwidth (increase Q) of the formants, the poles must be 
moved closer to the unit circle. If the distance from the 
center is multiplied by a GCOMS tant TacEor, tneme 1s a 
danger of moving poles outside the unit circle and thereby 
causing instability during reconstruction. However, the 
magnitude of the pole is always less than one and may be 
raised to any positive power without danger of crossing the 
unit circle. It is shown as follows that raising the 
magnitude to a factor is equivalent to multiplying the 
formant bandwidth by that same factor. 


The transfer function with the complex conjugate poles 


above is: 
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ni 72) eS = on a7 


However with the pole locations in polar form 
x = A cos 8 Y = A stn 8 
and making use of 


2 2 
cos 8+sin 8=1 


the equations becomes 
ee eS nee 
ro Z -l 2 -2 
1-2A cos’ z +A Zz 
Setting the terms of the characteristic equations equal we 


get 


7a)8 (CIB Ss) a As. Gos CAM 2 lige) 


and 


2  -4T (BW) fT, 


when solved for A and ®-— give 


-27 (BW) Te 


> . 
il 


e 

2 Meee 1G 
and inversely 

F =98§ / 27 Ts; 


BW: = (-In A) / 2TTT, 
If new formant characteristics, F’ and BW, are desired 
where 
Fis fF 


and 
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Bw' = CXBW 
they may be implemented by moving the poles of the 


Emeracteristic equation so that 
a'= Fe 


ta a 


and 


OGTnA 


which reduced to 
, OL 
A = A 
This method of implementing the pole shifts guarantees 
that no unstable poles will be created and is used in the 


moerowihneg section in the realization of a LPC voice 


modification system. 


3. GAIN ADJUSTIMENT 

The filter coefficients reconstructed from the 
relocated poles above may not nave the same zero frequency 
gain characteristic as the fiiter used for inverse 
Filtering during encoding. This situation can be 
illustrated graphically by the two vocal tract transmission 


eiemaereristics shown in figure 16. 


GF) GCF) 


Beas ene PROCESSING AFTER — 
(A) 


pace = 1) oS. FORMANT GAIN 
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Although the formant frequencies in 15(b) are lower than 
the corresponding frequencies in 16(a) as was desired, the 
overall gain was also changed. This would cause the 
reconstructed speecn to be much softer than desired. 

A solution to this problem was to adjust tke excitation 
function gain used during reconstruction. This adjustment 
factor would be equal to the ratio of the zero frequency 
gains of the original and modified vocal tract filters. The 


Peeaieeract has the following z-domain transfer function. 


1 
HiGZ)) = 


6 


? 


ad 


p 

1+ : 
yas 
i=l 

The above equation can be evaluated at 


iT f/f, 
Zo - 


to obtain the gain at frequency f Evaluating the above 


transfer function at f=06 yields the following equations. 


and 


This equation can be easily evaluated for both the 
coefficients of the vocal tract transfer function 
calculated from the input sequence and the coefficients 
calculated from the altered pole locations. The gain 


multiplication factor is then multieplied by the energy 


a) 





measured in the error signal to get the excitation gain to 


be used during reconstruction. 


tee Flush PERIOD ADJUSTMENT 

The adjustment of the measured pitch period may almost 
go without explanation except to note that [ff the pitch 
period is increased and all other coefficients remain 
unchanged, the output speech would be softer. This is due 
to the reduced energy (impulses less often) being input to 
the vocal tract filter and the resulting lower energy in 


the output speech. 
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VI. COMPUTER SIMULATION OF PITCH AND FORMANT MODIFICATION 


The process of pitch and formant modification was 
carried out on the IBM 360 computer with the input and 
output being accomplished on a hybrid system consisting of 
a COMCOR 5000 analog computer and an XDS 9300 digital 
computer, The interface between the XDS 9390 and the [3M 
360 was seven track digital magnetic tapde. ~All work was 
done on five second segments to allow sufficient length for 
analysis while not using excessive computer processing 


time. 


A. VOICE INPUT AND DIGITAL SAMPLING 

The input voice was recorded on a standard single tract 
audio tape recorder at 7 1/2 inches per second (ips). 
Recording was done with a high quality microphone in a 
quiet but not sound-oroof room. This digitizing was done at 
half speed tc allow the digital computer to write the data 
onto tape without missing any data. This recording was 
Played back at 3 3/h ins with the output directed to an 
amplifier of the analog computer. The voice was amplified 
to a level aoprooriate for the analog computer (a #100 volt 
machine). The amolifier output was passed through two 
moGtm-order analog filters set at 2550 Hz and 2400 Hz cut 
GSueenhrequencies, The output of the filters was then put 


into a samole and hold circuit at the input of a 14-bit 


Bo 





analog to digital converter. The 14 bits produced were 
read by the XNS 9300 and placed in the most sienificant 
bits of the 24 bit XDS 9300 computer word. This process is 


fllustrated in ficure 17, 


3 LOW SAMPLE ANALOG TO 
PASS HOLD DIGITAL Oe 


UT HT 


24 BIT WORD 


The sampling rate used was 5000 Hz. However the voice 





miceinc 17. DATA ACQUISITION 


recording was Dilayed back at half speed and there‘ore the 
equivalent lowpass filter cut off and the equivalent 


samoling rate were about 4750 and 10,000 Hz respectively. 


Brees 9500) OPERATICN 

The operation of the XDS 9300 during the Input phase 
was simply to read the data available at the output of the 
analog to digital converter and place this data in an 
array. When an array of 1024 samples was filled it was 
written onto a seven track magnetic tape. This was done 
continuously so that no data was lost between blocks, The 
voice segment 2g Ve existed on the seven track tape 


consisted of 50 blocks of 1024 samples. Each sample was 
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recorded in a integer format ranging from +8388607 to 
-8388607 (+(2**25)-1). This tape was then used as the input 


to the IBM 360. 


C. IBM 360 INPUT PREPARATION 

When the 24-bit word, seven track tape created by the 
XDS 9300 was read by the [8M 360, the machine 
representation of the values was not correct. This was due 


@oetmne aadition of theyeizht bits shown in figure 18. 


24-Bit XDS 9300 Word i = ||||||| | 
—. MWA 0== 0 SWolllllll! 
Corrected IBM 360 word [0 0 0 WN 


moeme 16. 








The data conversion program (Appendix A.1) was used to read 
the data from the seven track tape and move the bits of 
each value as required. The program did not make the 
conversion from ones complement representation (XDS 9300) 
to twos complement reoresentation (13M 360) because any 
error caused would be well below the 14-bit quantization 
error, At this point the data was converted to floating 
point representation with values between +100.0 and the 
average value of each sequence was calculated and 
subtracted from each data point. This insured that the 


input was a zero mean function, Each data sequence was 





written into a separate file of a standard nine track [3 


360 tape for ease of further handling, 


D. SCOPE OF SIMULATION PROGRAM 

The goal of this research was to demonstrate the 
heaowbi lity OF voice modification and as a result only 
certain areas were studied. Specifically, all programming 
was done with the standard 18M 360 floating-point 
arithmetic, making no allowance for the effects which would 
be caused by the shorter word length and integer 
representation used in most voice processing systems. 
Further study of that area Is warranted and would be 
especially critical in the determination of the pole 
location, which is covered later, 

The system degradation by background noise in the input 
speech was not studied except to note that the 
voiced/unvoiced deciion threshold would need to be adiusted 
for a noise environment, 

Although the programs were written to allow variation 
in the order of the prediction, number of samples per frame 
and sampling interval, these were not varied, A 12th order 
vOice tract filter was used throughout and proved to be 
satisfactory. The analysis frame length was 25.6 msec, 
(256 samoles) and also remained unchanged, In any future 
use of these programs with a different frame length, 
attention would be required by the [Input format to insure 


that the analysis frame length is an integral multiole of 
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the input record length. 

Finally, in the following description of the programs 
the term ‘LPC coefficients' will refer to the coefficients 
of the vocal tract model filter. The term 'LPC parameters’ 
will refer to the entire set of parameters needed to 
reconstruct the output speech, i.e. the LPC parameters 
consist of the LPC coefficients, the gain parameter, the 


pitch period and the voicing indicator. 


eee er G ENCODING 
The first step of the encoding process was to determine 
the filter coefficients. These coefficients were used in 
the inverse filter for determination of the error signal. 
The root mean square values of the input and error signals 
were compared to determine if the frame was voiced or 
unvoiced. Finally the pitch period was determined for 
voiced frames. This program is listed in Appendix A.2. 
merc Coertficient Determination 
Determination of the LPC coefficients was done with 
the autocorrelation method in the subroutine named AUTO. 
First, the input data, s(n), was windowed by one of four 
available windows producing a temporary array, t(n), of the 
windowed data. 
ECny = Whee xs Cay 
The discrete autocorrelation of the temporary array was 


G€aliculated for the discrete displacements of zero to the 


mreadretor order, p. 
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R(j) = yen Gale, 


0O< jf PD 
The next step was the solution of the following matrix 


equation. 


fe 


» RCs a a ee 
u) 


j=l 
i eee 


The auto correlation matrix in always positive definate, 
symetric and all values along a given diagonal are equal. 
Pmecamercularly efficient method of solution is available. 
This method is attributed to Durbin |Makhoul, 1975| and is 
implemented in subroutine COEFF. Durbin's algorithm is 
recursive and calculates the predictor coefficients for the 
Kth order from the coefficients for the (k-1l)th order. The 
Wemecoetrticient for the kth order predictor is a. (Kk). The 


recursion formulas follow. 


BGin = RCo 
eal 
Je = RGD ) a Gen RC-1) | /ECk=1) 
i=1 
ES ESS 
Spi = a. (k-1) = a, Ck) Seep 


LS, TIS Ge 
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2 
E(k) = (l-a, (k) Pee Creniee) 


E(k) is the prediction order error resulting from limiting 
Pgempredictor order to k. 

During the programming of COEFF the subroutine TEST 
was written to perform and print the results of the matrix 
multiplication. During the initial testing of the program 
various window functions were used in AUTO, however the 
Meeaterion order error did mot change significantly with 
the window function used. 

Certain researchers have noted that a lower order 
filter may be used during unvoiced speech. I!f this is 
desired, the coefficients for the lower order filters could 
be stored during the recursive steps of the algorithm above 
and later, when the frame is determined to be unvoiced, the 
lower order filter coefficients would be available without 
Tieener Calculation. 

The coefficients, a., used in the main program are 
the coefficients of the characteristic polynomial of the 


filter with a assumed to be unity. 


Therefore the negitive of the values calculated in COEFF 


were returned to the main program. 
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weeerOr olemnal Petermination 
The error signal, e(n), is determined by 
subtracting the predicted sample value, $(n) from the 


actual value, s(n). 


e(n}) = s(n) = S(n) 
p 
s(n) _) a. s(n-i) 
j=1 { 
p 
e(n) = s(n) + y an ok =D 
i=l 
ities operation is ~earried out by subrowtine ERR. [n order 


fommake a Correct error determination at the begining of 
each frame, a number of samples equal to the order of the 
predictor were saved from the end of the previous frame. 
This eliminated additional error signal energy caused by 
poor begining of frame prediction and reduced the 
possibility of an incorrect voicing decision. Another 
Poasore Solution to this problem would be just not 
analyzing the error for the first few samples of each frame 
and making the appropriate changes in the following 
routines that use the error signal. 
3. Voicing Decision 

A comparison of input signal energy and the error 
signal energy was used to determine if a particular frame 
is voiced or unvoiced. Although the root mean square value 


of each set of data is actually proportional to the square 
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root of the energy in the signal, the root mean square 
value was used in this comparison. Whenever the root mean 
square value of the input signal divided by the root mean 
square value of the error signal was greater than a 
threshold value, the frame was determined to be voiced and 
the voicing indicator was set to one. Otherwise the voicing 
indicator was set to zero. 
4. Pitch Period Determination 

The error signal was used in subroutine PITCH for 
determination of the pitch period of each voiced frame. 
First the error signal was passed through a recursive 5t 
arder Butterworth filter with an 800Hz cut off, to smooth 
the signal. Extra samples of the error signal and filtered 
error signal were saved from frame to frame (zeroed during 
unvoiced frames) to insure a correct filtered error signal 
at the begining of each frame. The degradation of the 
system if this was not done was negligible but plots of the 
filtered error signal would have shown discontinuities at 
the begining of each frame if this had not been done. The 
frame was windowed to eliminate end effects and the 
autocorrelation function of the filtered error signal is 
calculated. The portion of the autocorrelation function 
from 12 to 180 samples was searched for peak values and the 
pitch period set equal to the location of this peak. 
Figure 19 shows a typical autocorrelation function and the 
portion of the curve searched for the peak value. The peak 


picking algorithm checked to insure that the value chosen 
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was not on the downslope of the center peak and was not a 


minor peak with a larger peak at a longer pitch period. 


Reel) 
far 
ee 
REGION SEARCHED 
FIGURE 19. 


Although this pitch determination algorithm worked 
Selsracetorily in this program it is probably not as 
accurate and flexible as certain other, more complicated 
techniques available. It was used only for pitch periods 


Meemeabout 5 to 9 msec., but was satisfactory for them. 


F. LPC PARAMETER MODIFICATION 

The purpose of the program was to demonstrate the 
modification of voice characteristics. The system was 
designed so that only the LPC parameters were needed to 
make the desired modifications. No other measurements of 
the input speech are needed. Of the parameters calculated 
from the input speech, only the voicing indicator remained 
unchanged. The LPC coefficients are varied as required by 
the desired formant frequency and bandwidth changes 


require. The pitch period is varied separately and the gain 
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is adjusted to correct for changes caused by formant 
bandwidth modification. 
Pe see. Coctficelent Modification 

The modification of the LPC coefficients is 
accomplished by three subroutines: POLES, ALT, and NEWCF. 
Subroutine POLES calculates the z-plane pole locations from 
Mmemwerc COeTTicients. Subroutine ALT changes the locations 
fSieeme poles according to the various scale factors 
specified by the main program. The new predictor 
coefficients are calculated by subroutine NEWCF. 

iMiememea!Ctoneeoert i eclents, a., are provided to 
subroutine POLES to get the p order z-domain polynomial 
which is factored into its component roots, the z-plane 
Bouleseer the vocal tract filter. This factorization is 
done with library routine ZRPOLY which was sufficiently 
accurate and produced complex conjugate pairs which were 
exact complex conjugates. This simplified the problem 
which came up later, of separating the real poles and the 
complex conjugate pairs so that the proper scaling factor 
could be applied to each. The input polynomial had all 
real coefficients and therefore all the roots are real of 
in complex conjugate pairs. These poles are placed in a 
complex array and returned to the main program. 

The subroutine ALT was provided with the complex 
array of pole locations and it separated them into separate 
arrays of real and complex poles. Each complex conjugate 


pole pair was entered as one entry in the complex pole 
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array. The scaling factors provided to subroutine ALT 
consisted of: 

(1) FSC - Formant frequency scaling factor 

(2) BSC - Formant bandwidth scaling factor 

(3) RSC - Real pole scaling factor 

(4) RLIM = Real pole magnitude limit 

(5) SP - Sampling period 

The polar coordinates were determined for each pair 

of complex conjugate poles and the magnitude, A, and angle, 
8, of each were considered separately. The magnitude was 
raised to the power of the bandwidth scale factor and the 
angle was multiplied by the frequency scale factor. 


BSC 
A' =A 


9! 


ul 


So Xm oe 

inemmoaitied magnitude, A', and angle, 8', were used to 
determine the complex location and the calculated pole and 
its conjugate were put in the pole vector for output. 
During the alteration process each complex pair of poles 
was checked against a constant magnitude of 0.98 to insure 
that numerical instability or repeated impulses would not 
cause excessively large outputs. 

Each real pole was multiplied by the real pole 
scale factor and checked to insure that the magnitude was 
mecsseihan ene limit prescribed. The effects of varying the 
real poles was not studied and areal pole limit of 0.95 


proved to guarantee sufficient damping of the output to 
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provide a nearly zero mean output. 

The poles from both the real and complex pole 
arrays were combined into one array for return to the main 
program. Subroutine ALT also provided graphical and 
printed output of the pole locations, before and after 
modification when this was desired. Figure 20 Is an example 
of the graphical output which shows the z-plane pole 
locations before and after modification, in relation to the 


init circle. 





Scomeue Ouse VOCAL. TRACT AP Owes 
K INPUT 


+- AFTER MODIFICATION 


Subroutine NEWCF performed the task of multiplying 


the poles to calculate the coefficients of the modified 
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Bnopacterh istic equation for the vocal tract filter. This 
operation was done in double precision arithmetic because 
the predictor coefficients being calculated often differed 
by only small amounts. This process would require close 
study before this system could be implemented on a short 
word length processor. 
2. Pitch Period Modification 

The pitch period was modified in the main program 
and consisted only of converting the pitch period (an 
integer) to floating point representation, multiplying by 
the pitch period scale factor, and reconverting to fixed 
point representation. Although changing the pitch period is 
relatively simple, a number of other changes are caused by 
modifying the pitch period. If the pitch period is 
shortened the gain must be reduced to make up for the 
increased energy being input to the vocal tract filter. 
The relationship between the pitch period and the formant 
bandwidth also requires further study. It appears that the 
formant bandwidths (Q's of the vocal tract resonators) 
should produce a impulse response which is significantly 
attenuated by the time the next impulse is input to the 
filter. There is most likely a feedback effect between the 
vocal tract resonators and the vocal cords vibration rate 
which is not considered by the model used. This effect is 
noted in the graphical output as sharp discontinuities at 


the point where each new impulse is generated. 
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3. Gain Adjustment 

Although overall gain of the system can be adjusted 
easily at the output, the relative amplitude from frame to 
frame must be retained during the processing. The gain 
coefficient, root mean square of the error function, is 
adjusted to account for the change in the energy of the 
vocal tract impulse response brought about by the bandwidth 
changes. As was described earlier the ratio of the original 
and modified vocal tract filter gain a zero frequency is 
used to estimate the ratio of inpulse response energy. 
Peenouesty this is mot Strictly true, as long as the scaling 
neceenseare limited to those which produce realistic 
speech sounds, this appears to work very well. The zero 
Frequency gain of the original vocal tract filter, G(in), 


femeciiculated before the LPC coefficients are modified. 


Tmeevalue of both a, and ale is Un | tyaeat te: the 


Seetpicients are modified the same calculation is performed 


Get) = ) a! 


i =0 


again. 


The root mean square of the error signal, rms(—E), is 


multiplied by the ratio to obtain the new gain coefficient, 
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ms (CE). 


fils) CE =o mmsce) x< Glin) / GCout) 


fee of eECH RECONSTRUCTION 
ReecensSunuction Of the sampled speech waveform, from the 
me@irtled LPC berameters is acccmolished by subroutine 
PeaeeNe nis routine not only deecdes both voiced and 
unvolced speecn, but also makes allowance for the 
transition of varying parameters from frame to frame. The 
LPC parameters from the previous frame are saved between 
calls to subroutine COEFF and are used during the current 
frame wnen needed. It is also necessary to save output 
values from the previous frame to allow the recursive 
calculation of the output values at the begining of the 
eUmtent frame. 
1. Unvioced Speech 

During continuous unvoiced speech (as opposed to 
the previous frame being voiced) the new LPC parameters are 
used Immediately upon entry to subdroutine RECON. The 
excitation function is determined by calling a library 
routine GGNOF which returns normally distributed random 
numbers with zero mean and a variance of unity, and 
multiplying the value returned by the gain parameter. Tne 
excitation function is changed for every output sample to 
simulate the continuous excitation caused by turbulent air 
in the vocal tract. The vocal tract filter is implemented 


by the recursive addition of past values of the output to 
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Ememexcitation function. The z-domain transfer function 


SCZ il 
e(z) = p < 
as a.z' 

j =1 


is implemented with the discrete time function 


Sh = (Gq) = : a. s(n-i) 
j =] 
where s(n) is the output sample and e(n?} is the excitation 
meniGee | On. 
eee CSG Speech 

During voiced speech a certain amount of continuity 
must be maintained from frame to frame. This was 
accomplished by allowing any uncompleted pulses from the 
previous frame to finish before the parameters are changed. 
lmmediately upon entering the subroutine during voiced 
Speech the pulse period counter is tested to see if it is 
equal to the former pulse period. If the former odulse is 
not complete the routine goes ahead and recursively 
calculates the cutput values. Upon completion of a pulse 
from a former frame or any pulse during the current frame, 
the new LPC parameters are used to replace the old one. 
There was a direct replacement for all parameters except 
the gain coefficient. The geometric mean of the old and new 
gain coefficients is used for the gain on the current pulse 
and the old gain replaced with the gain iust calculated. 


This provides for the difference between the old and new 
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gain parameters to decay exponentially but prevents sharp 
changes in amplitude from frame to frame and make the 
output spsech more natural. 
Seeirans elf Frames © 

[lf the current frame and the previous frame were 
not of the same type care must be taken to insure that all 
Narameters are changed together. If LPC coefficients for 
unvoiced speech were used with a pulsed output an unnatural 
sound would be likely to be produced. During the 
transition from unvoiced to voiced spesch, the retained 
values from the previous frame are normally small in 
comparison to the amplitude of the pulsed excitation 
function. Therefore the voiced speech production may begin 
immediately. When the opposite is true, the large 
amplitude samples near the begining of a output pulse are 
significantly larger than the unvoiced excitation values. 
Therefore whenever unvoiced speech follows a voiced frame, 
the previous output pulse is allowed to finish. The 
damping that occurs during the voiced pulse normally 
reduces the magnitude of the samples near the end of the 
pulse to the point where they will not interfere with the 


unvoiced speech to follow. 


ieee owl er ReCESS | NG 


Tne reconstruced speech sampl2s are output onto a 
Standard nine track IBM 360 magnetic tape. These values 


were later input to a data conversion program (Anpendix 
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A.&) which converted the floating point values to integers 
which were in the proper format for the XDS 9300 and within 
an appropriate range for the XDS 9300's digital to analog 
converter. Tne necessity of using a seven track tape for 
data transfer still existed, so the significant bit of the 
integers had to be shifted into the proper pcsition so that 
none of the eight bits dropped during the writing of each 
value onto the seven track tape would effect the data. 

This tape was input to the XDS 9300 which via the digital 
to analog converter made the samples available on the 
COMCOR 5000 In analog form. 

These samples were output at a rate of 5000 per second 
aire a sample and hold circuit. Again two low pass filters 
were used to remove the time quantization noise from the 
Samples. The analog waveform was recorded at 3 3/4 ips on a 
standard tape recorder which could be played at 7 1/2 ips 


to hear the reconstruced speech. 


Peeanaen! Cal OUTPUT 

The programs described above were also able to produce 
a varity of graphical outputs to assist the researcher in 
following the signals through the LPC processing. The 
waveforms available from these programs are: 

(1) Input speech 

(2) Error signal before filtering 

@ MeijsOreslenealtarter fi lterins 


(&$) Reconstructed output speech 
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The z-plane pole locations determine the formant 
frequencies and bandwidths and were also available for 
graphical display. A seperate program (Appendix A.3) was 
written to display the logarithmic power spectral density 
of the input and output speech for a number of consecutive 


frames and proved useful in analysis of the ocutput quality. 
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Vile “RESU LIS 


The desired result of this study was the reconstruction 
of speech at different pitch and formant frequencies than 
that of the input speech. The complete process of 
encoding, modification and decoding was accomplished for 
three 5-second segments of speech. Upon completion of the 
process most listeners agreed that although the input 
speech was female, the modified output speech sounded 
typically male. Although the audio output was somewhat 
lacking in quality it was intelligible. 

Examples of the printed and graphical computer output 
are given in Appendix B. Two examples are completely 
covered. The first 384 msec. segment (15 frames) is of the 


t t 


vowel ‘e’ and the second segment Is of the transition from 


1 ' 


a fricative to a voiced sound, ‘sa', from the begining of 
the word salt. Both were derived from a recording of a 
female speaker were reconstructed first without 
modification and then with modificaations which consisted 
Saweeadtechoneorf the pltch frequemecy oy a factor of 0.58 and 
reduction of the formant frequencies by a factor of 0.88. 
First the input waveform with the logarithmic power 
spectral density plot of that portion of the speech is 
given. Examples of the printed processing summary are next 


and are followed by the waveforms of the error signal and 


Gnemplitened error sienal. Plots Of the vocal tract pole 
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locations are shown with the poles at input superimposed on 
the poles after modification. Finally, speech waveforms for 
both unmodified and modified output with their respective 
logarithmic power spectral density functions are displayed. 
The audio output is available from the author on request, 
in the form of an audio tape recording. This tape 
recording is described in detail in Appendix C. 

The results above demonstrate the feasibility of the 
use of linear predictive coding as a technique for voice 
modification. This research also indicated areas in which 
further study and improvement may be made. Some of these 
areas are: 

(1) The effect of noise during voiced speech on the 

Beearetion error and on the gain calculated from 

the error. It may be possible to use only the 

energy occuring at the peaks of the error signal 

and thereby attribute the remainder of the error 

signal as being due to noise. 

@oemiine effect of the Use of different window 

mune@elons im autocoerretation flinction calculation 

and how this variation effects pitch period 

ae eernimarion and the volcing threshold. 

Gemine possibility ef constructing a LPC 

Mmecescsim= system Whth asyneronous clocks for the 

frame timer and the output sample gereration. This 


would produce a very similar effect to that 
accomplished here, but probably at a reduced cost. 
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Velie CONC EUS TORS 


With the refinement and standardization of LPC 
commuication processors, the ratio of processing time to 
real time for unaltered communication is expected to drop 
below the current 65%. The available computation time may 
be used for the pitch and formant alteration described 
above or for other modification which can be accomplished 
at either the transmitting or receiving processor and still 
allow real time volce communications. 

A number of possible applications of the speech 
frequency characteristic modification described are: 


Cem wcisi tale hearcinz aid for persons (stich as the 
author) with high frequency hearing loss. 


Gemesaios sin military vehicles which would produce 
speech ina frequency range different than the 
range of the predominant noise in the vehicle, i.e. 
lovmpiten Yolce in turbine aircraft with high 
frequency noise and high pitched voice for 
helicopters and tanks where low frequency nolse is 
most prevalent. 


(3) Voice channel jammers which would produce 
random phonemes with pitch and formant 
Gvaracteristics similar to the current users of the 
Ehranime | . 
As LPC communications systems become common because of 
their low data rate requirements, the use of the LPC 
parameter modification will be desired to extend the 


flexiblity of voice communication and storage systems. 


Frequency modification is one viable process available. 
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SEVEN TRACK TO NINE TRACK TAPE CONVERSION 
PROGRAM 


APPENCIX A.J 
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SUBROUTINE RECONTA, IP »RMS sIVF »IPP»N;S) 


C 
: REGOUSIRUGTSESPEEGH SAMPLES GFROM LPC COEFF, ETC 
© @ = VECTOR GF LPC COEFF 
C IP = NUMBER OF COEFF (SRDER OF FILTER) 
€C RMS = RMS VALUE OF ERROR SIGNAL 
C IVF = oO UNVOICED 
C = ji VOICED 
C IPP = PITCH PERIOD IN NUM8ER OF SAMPLES 
C N = SAMPLES PER FRAME 
C S$ = SAMPLE VECTOR (OUTPUT) 
DIMENSION AL )oS(2) oX(270) XX (14), 40014) 
Dees ar SEED. 1 VED 1 S* 0. 071244, 0/ 
DO 30 I = L,IFt 
X(1) = XX{1) 
10 CONTINUE 
NIP = N+ti P 
: NS = 1+i? 
: IF CURRENT PULSE UNFINISHED DON'T CHANGE COEFF YET 
; IF(IVFO.NE.O} GO TO 400 
c Neeiee COEFE 
100 R¥SG = SQRT(RMSOXRMS) 
IF (IVE.EQ.0) RYSO=RMS 
IF (RMSOSLT.(RMS/2.0)) RMSG=RMS/2.0 
09 »b65 [ = 1, 1P 
AQ(i) = ACI) 
105 CONTINUE 
ivec = Ive 
: iPppd = IPP 
C TESIMELS VOT ED 
: IFC IVEFO.NE.0) GO TO 300 
c PEGENGTRUGT UNVOICED SPEECH 
200 E = 2MSO*GGNOF(ISEED} 
Boe 2LO) = tsi? 
NSMI = NS=I 
E = E-A(1)*X(NSMI) 
210 CONTINUE 
X(NS) = 
We NG GES hte | ele apalcieye 
NS = NS#i 
’ GO TO 200 
C SARVOIGEO PULSE 
300 NP = i 
: EX =RMSO* SQRT(FLOAT(IPPO)) 
C TEST FOR BEGINING OF PULSE PERIOD 
400 IF (NP.GT.IPPO) ea) pa Shove 
; ie ie a), IS ee Srey 
C Qe caiG pe UGie ahs EO SPEECH 
500 CO 10 I =1,IP 
NSMI = NS=I 
E = G=A{1)*X(NSMI) 
510 CONTINUE 
NP = NP4i 
X(NS) = E 
TeUNceGe Nie) Ga TO 600 


90 





NS = NS#t 
GO T9 400 


C 
2 SAVE VALUES AND FREPARE OUTPUT 
600 DGy omsOrnt. = 1, IP 
XX( I) = X(N+I) 
610 CINT INUE 
O00 620 I = 1,N 
S(I) = XC i+IP) 
520 CONTINUE 
RETURN 
END 
2 SUBROUTINE RMS (XyNeVAL) 
: DETERMINE THe RMS VALUE OF A SEF GF CATA 
CE xX = VECTOR GF -INPUT SAMPLES 
C N = NUMBER OF SAMPLES 
S VAL = RMS VALUE RETURNED 
OIMENSION X21) 
VAL _ = V0.9 
DO 10 I = UyN 
VAL = VAL+X(I)**2 
10 CONTINUE 
VAL = SQRT(VAL/FLOATIN)) 
ar Seana 


s} il 





; SUBROUTINE WINDW(X>Y_Ny IWIN) 
€ X = VECTOR OF UNWINDOWED SAMPLES 
C Y = VECTOR SF WINDOWED SAMPLES (QUTPUT) 
C N = NUMBER OF SAMPLES 
C IwIN = TYPE OF WINDOW 
€ 0 = RECTANGULAR (COPY ONLY) 
C 1 = HAMMING (ALPHA = 0.54) 
C 2 = BARTLETT 
€C 3 = BLACKMAN 
C oe aN amNc 
DIMENSION X(1)9Y(1) : 
Pe ee ee GRE 74204159 26,6. 2921953,12.5663 72 / 
[em Nea SOL GRSIWIN.GT.4) GO TG S$ss 
AN = FLOAT(N) 
C GO TO (£410,210, 310,410) »sIWIN 
; RECTANGULAR WINDOW COPY VECTOR 
i0 CO 20 I=kyN 
Y(I) = X(T) 
20 CONTINUE 
; RE TURN 
C HAMMNING WINOGW 
ie) 90 iZo l=i,N 
ema OAT ir - 
YtL) = X(L)¥*( Co 34-0. $6*COS( TWOPT#AJ/(AN=2 .0))) 
120 CONTINUE 
: RETURN 
¢ BARTLETT WINOOW 
310 NN = N/2 
NNN = NN#i 
30 220 IT=1,NN 
Mi SFE OAT (i= 1) 
Ct) =" X11) 250%A0/ (AN=1-0) 
220 CINTINUE 
DQ 230 1=NNN,N 
a Swe OA T(I—1) 
TiC 1) ko 401) O=AU/CAN-1..0)) 
230 CONTINUE 
RE TURN 
€ BLACKMAN WINOGW 
310 CO 320 I=1,N 
AJ = FLOAT(L-1) 
YCI) = XC1)3( 0 042-0.5#C OS (TWOPL*AJ/(AN=1.0) } 
x +0.08¥*CCS ( FORPI*AJ/ (ANmL eC) )) 
320 CONTINUE 
RETURN 
€ HANNING WINDOW 
410 09 420 I=1,N 
AG eeCOA TL L=2) 
GI a lh kG 25 + (1) 0=C OS GIWGe T+ Ad7 (AN= 1.0) 1} 
420 CONTINUE 
RETURN 
Sao WRITE 6998 1: 
$98 FIRMAT(// LOX, *** ERRCR SUBR WINDOW **'//) 
END 
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PereNDIA B.1 COMPUTER ANALYSIS AND MODIFICATION OF VOICED 
SPeeCh 


The 15 frame ( 384 msec. ) segment of speech analyzed in 
this appendix is the 'long e'' sound (as in need) and is 
spoken by a woman. The process illustrated snows doth 
direct reconstruction and reconstruction with the pitch 
reduced by a factor of 0.58 and the formant frequencies 
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Figure B.1.3(b) 
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Processing Summary of Frame 4 


Figure B.1.3(c) 
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this appendix is the "sa" sound (begining of salt) and is 
spoken by a woman. The process illustrated shows both 
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Processing Summary of Frame 5 


Figure 2.3(d) 
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Processing Summary of Frame 6 


Figure B.2.3(e) 
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Figure 8.2.4 WAVEFORM OF ERROR SIGNAL 


Figure B.2.5 WAVEFORM OF FILTERED ERROR SIGNAL 
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Figure B.2.8 WAVEFORM OF MODIFIED OUTPUT SPEECH 
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Pace NDIA GC» DESCRIPTION OF VOICE TAPE 


The audio recording which is available from the author has 
four sections each of which contains three segments of 
speech. These three speech segments are of the following 


sounds: 


Segment 1 - Five long vowels. 


Segment 2 - Four words which are combinations of 
fricatives and voiced sounds. 


TSaretnee li padone 
Segment 3 - A sentence with a varity of sounds. 
Ever V asc tabineeze comes from the sea.’ 
Each of these segments is repeated in each segment of the 
tape. Each section of the tape shows the effects of a 
different step In the processing. 
Section 1 - Unprocessed speech, the recording used 
fOopelnput to Ene processing system. 
Section 2 - Speech which has been converted to 
digital form and then converted back to analog 
Tommmwien mo other processing. 
Section 3 - Speech which has been encoded into a 
set of LPC parameters and then decoded using the 
Same parameters (i.e. no modification). 
Section 4 - Speech which has been encoded into a 
set of LPC parameters and those parameters altered 
to reduce the pitch frequency by a factor of 0.56 
and to reduce the formant frequencies by a factor 


of 0.88. The same LPC decoding process is then 
used to reconstruct the speech segment. 
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